Analysis of Unsupervised Clustering by Crossing Minimization
نویسنده
چکیده
In [3] it was demonstrated for the first time that crossing minimization of bipartite graphs can be used to perform unsupervised clustering. In this paper, we will present the detailed analysis of the bipartite graph model used to perform unsupervised clustering as in [1, 2, 3]. We will also discuss the effect of data discretization, followed by simulation results demonstrating the noise immunity of the technique. INTRODUCTION Our ability and capacity to generate, record and store multidimensional, apparently unstructured data is increasing rapidly in the areas of natural, life and social sciences. Finding that valuable piece of information in a mountain of data is a hard problem. But when you know that you are looking for a needle, finding it in the haystack is easy, as compared to when you don’t know what you should be looking for, such as existence of hidden patterns and relationships in data, the formal study of such problems is called Data Mining. Data mining problems for large data sets can only be attempted using automatic means. One such interesting Data mining problem is called clustering. Clustering is a process of partitioning a set of (data or objects) in a set of meaningful sub-classes or clusters. There are several techniques and methods for clustering such as Hierarchical, Partition, Graph theoretic, Agglomerative, Artificial Neural Networks and also Simulated Annealing. In [3] it was shown for the first time that graph drawing technique i.e. crossing minimization can be successfully used for unsupervised clustering. Plain crossing minimization technique (CMH) when used for crossing minimization has limited application because of deterioration in the presence of noise. Therefore, in [3] a recursive technique (CRA_CMH) was also proposed resulting in high noise immunity. Both techniques as well as meta techniques [1] using [4] since than have been used to successfully mine biological and agriculture data. NOMENCLATURE Knowledge Representation, Information Theoretic Approaches, Statistical and Numerical Aspects. 2.0 DEFINITIONS, AND MODEL FORMULATION The input data for a clustering problem is typically given in one of the two forms [11]: • Data matrix (or object-by-variable structure) is an n × p matrix, where corresponding to each of the n objects there are p variables, also called measurements or attributes. Usually n >> p. • Similarity (or dissimilarity) matrix (or object-by-object structure) S is an n × n symmetric matrix, which contains the pair-wise similarity (or dissimilarity) that is usually computed from p for all pairs of n objects. Let the given data matrix to consist of interval scaled variables. Using the given data matrix a similarity matrix S is generated using Pearson’s Correlation defined as:
منابع مشابه
Data mining using the crossing minimization paradigm
Our ability and capacity to generate, record and store multi-dimensional, apparently unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden inf...
متن کاملHigh-Dimensional Unsupervised Active Learning Method
In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...
متن کاملComparison Between Unsupervised and Supervise Fuzzy Clustering Method in Interactive Mode to Obtain the Best Result for Extract Subtle Patterns from Seismic Facies Maps
Pattern recognition on seismic data is a useful technique for generating seismic facies maps that capture changes in the geological depositional setting. Seismic facies analysis can be performed using the supervised and unsupervised pattern recognition methods. Each of these methods has its own advantages and disadvantages. In this paper, we compared and evaluated the capability of two unsuperv...
متن کاملRobust Unsupervised Clustering Using Generalized Annealing M-estimator
A new robust clustering algorithm, called generalized annealing M-estimator (GAM-estimator), is proposed. Initialized with multiple seeds, the GAM-estimator converges to several optimal cluster centers. Neither knowledge about the number of clusters nor scale is needed. The global optimal solution of clustering is achieved by minimization of an objective function. The algorithm is applied to un...
متن کاملUnsupervised and supervised data classification via nonsmooth and global optimization1
We examine various methods for data clustering and data classification that are based on the minimization of the so-called cluster function and its modifications. These functions are nonsmooth and nonconvex. We use Discrete Gradient methods for their local minimization. We consider also a combination of this method with the cutting angle method for global minimization. We present and discuss re...
متن کامل